Skip to content

WarcCdxWriter: normalize URL of redirect target location#34

Merged
sebastian-nagel merged 1 commit into
ccfrom
warc-writer-normalize-redirects
Mar 15, 2025
Merged

WarcCdxWriter: normalize URL of redirect target location#34
sebastian-nagel merged 1 commit into
ccfrom
warc-writer-normalize-redirects

Conversation

@sebastian-nagel

Copy link
Copy Markdown

Convert the redirect target location into an absolute URL and normalize the URL using the URL normalizer configured for scope "fetcher" before storing it as field "redirect" in the CDX file.

This PR fixes a regression introduced with b3b78bb: redirect targets are not converted to absolute URLs. However, also normalization is required. Otherwise the redirect URLs in the URL index may include various representations of equivalent host names (upper/lower case, IDNs), or even different forms of URL path and query. Only URL normalization makes it possible to reliably follow redirects in the URL index.

Minor change: create all instances of SimpleDateFormat using the ROOT locale, and use the timezone "UTC" consistently.

Convert the redirect target location into an absolute URL
and normalize the URL using the URL normalizer configured
for scope "fetcher" before storing it as field "redirect"
in the CDX file.

Create all instances of SimpleDateFormat using the ROOT
locale, use timezone "UTC" consistently.
@sebastian-nagel sebastian-nagel merged commit ce36046 into cc Mar 15, 2025
@sebastian-nagel sebastian-nagel deleted the warc-writer-normalize-redirects branch March 15, 2025 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant